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Abstract 

Sparse coding — that is, modelling data vectors as sparse linear combinations of basis 
elements — is widely used in machine learning, neuroscience, signal processing, and statis- 
tics. This paper focuses on the large-scale matrix factorization problem that consists of 
learning the basis set in order to adapt it to specific data. Variations of this problem in- 
clude dictionary learning in signal processing, non-negative matrix factorization and sparse 
principal component analysis. In this paper, we propose to address these tasks with a 
new online optimization algorithm, based on stochastic approximations, which scales up 
gracefully to large data sets with millions of training samples, and extends naturally to 
various matrix factorization formulations, making it suitable for a wide range of learning 
problems. A proof of convergence is presented, along with experiments with natural images 
and genomic data demonstrating that it leads to state-of-the-art performance in terms of 
speed and optimization for both small and large data sets. 

Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, 
sparse coding, sparse principal component analysis, stochastic approximations, stochas- 
tic optimization, non-negative matrix factorization 



1. Introduction 

The linear decomposition of a signal using a few atoms of a learned dictionary instead of 
a predefined one — based on wavelets (Mallat, 1999) for example — has recently led to state- 
of-the-art results in numerous low-level signal processing tasks such as image denoising 
(Elad and Aharon, 2006; Mairal et al., 2008b), texture synthesis (Peyre, 2009) and audio 
processing (Grosse et al., 2007; Fevotte et al., 2009; Zibulevsky and Pearlmutter, 2001), as 
well as higher-level tasks such as image classification (Raina et al., 2007; Mairal et al., 2008a, 
2009b; Bradley and Bagnell, 2009; Yang et al., 2009), showing that sparse learned models 
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are well adapted to natural signals. Unlike decompositions based on principal component 
analysis and its variants, these models do not impose that the basis vectors be orthogonal, 
allowing more flexibility to adapt the representation to the data. 1 In machine learning and 
statistics, slightly different matrix factorization problems are formulated in order to obtain 
a few interpretable basis elements from a set of data vectors. This includes non-negative 
matrix factorization and its variants (Lee and Seung, 2001; Hoyer, 2002, 2004; Lin, 2007), 
and sparse principal component analysis (Zou et al., 2006; d'Aspremont et al., 2007, 2008; 
Witten et al., 2009; Zass and Shashua, 2007). As shown in this paper, these problems 
have strong similarities; even though we first focus on the problem of dictionary learning, 
the algorithm we propose is able to address all of them. While learning the dictionary 
has proven to be critical to achieve (or improve upon) state-of-the-art results in signal and 
image processing, effectively solving the corresponding optimization problem is a significant 
computational challenge, particularly in the context of large-scale data sets that may include 
millions of training samples. Addressing this challenge and designing a generic algorithm 
which is capable of efficiently handling various matrix factorization problems, is the topic 
of this paper. 

Concretely, consider a signal x in M. m . We say that it admits a sparse approximation 
over a dictionary D in M. mxk , with k columns referred to as atoms, when one can find a 
linear combination of a "few" atoms from D that is "close" to the signal x. Experiments 
have shown that modelling a signal with such a sparse decomposition (sparse coding) is very 
effective in many signal processing applications (Chen et al., 1999). For natural images, 
predefined dictionaries based on various types of wavelets (Mallat, 1999) have also been 
used for this task. However, learning the dictionary instead of using off-the-shelf bases 
has been shown to dramatically improve signal reconstruction (Elad and Aharon, 2006). 
Although some of the learned dictionary elements may sometimes "look like" wavelets (or 
Gabor filters), they are tuned to the input images or signals, leading to much better results 
in practice. 

Most recent algorithms for dictionary learning (Olshausen and Field, 1997; Engan et al., 
1999; Lewicki and Sejnowski, 2000; Aharon et al., 2006; Lee et al., 2007) are iterative batch 
procedures, accessing the whole training set at each iteration in order to minimize a cost 
function under some constraints, and cannot efficiently deal with very large training sets 
(Bottou and Bousquet, 2008), or dynamic training data changing over time, such as video 
sequences. To address these issues, we propose an online approach that processes the signals, 
one at a time, or in mini-batches. This is particularly important in the context of image and 
video processing (Protter and Elad, 2009; Mairal et al., 2008c), where it is common to learn 
dictionaries adapted to small patches, with training data that may include several millions 
of these patches (roughly one per pixel and per frame). In this setting, online techniques 
based on stochastic approximations are an attractive alternative to batch methods (see, e.g., 
Bottou, 1998; Kushner and Yin, 2003; Shalev-Shwartz et al., 2009). For example, first-order 
stochastic gradient descent with projections on the constraint set (Kushner and Yin, 2003) 
is sometimes used for dictionary learning (see Aharon and Elad, 2008; Kavukcuoglu et al., 
2008 for instance). We show in this paper that it is possible to go further and exploit the 

1. Note that the terminology "basis" is slightly abusive here since the elements of the dictionary are not 
necessarily linearly independent and the set can be overcomplete — that is, have more elements than the 
signal dimension. 
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specific structure of sparse coding in the design of an optimization procedure tuned to this 
problem, with low memory consumption and lower computational cost than classical batch 
algorithms. As demonstrated by our experiments, it scales up gracefully to large data sets 
with millions of training samples, is easy to use, and is faster than competitive methods. 

The paper is structured as follows: Section 2 presents the dictionary learning problem. 
The proposed method is introduced in Section 3, with a proof of convergence in Section 4. 
Section 5 extends our algorithm to various matrix factorization problems that generalize 
dictionary learning, and Section 6 is devoted to experimental results, demonstrating that 
our algorithm is suited to a wide class of learning problems. 

1.1 Contributions 

This paper makes four main contributions: 

• We cast in Section 2 the dictionary learning problem as the optimization of a smooth 
nonconvex objective function over a convex set, minimizing the (desired) expected cost 
when the training set size goes to infinity, and propose in Section 3 an iterative online 
algorithm that solves this problem by efficiently minimizing at each step a quadratic 
surrogate function of the empirical cost over the set of constraints. This method is 
shown in Section 4 to converge almost surely to a stationary point of the objective 
function. 

• As shown experimentally in Section 6, our algorithm is significantly faster than pre- 
vious approaches to dictionary learning on both small and large data sets of natural 
images. To demonstrate that it is adapted to difficult, large-scale image-processing 
tasks, we learn a dictionary on a 12-Megapixel photograph and use it for inpainting — 
that is, filling some holes in the image. 

• We show in Sections 5 and 6 that our approach is suitable to large-scale matrix 
factorization problems such as non-negative matrix factorization and sparse principal 
component analysis, while being still effective on small data sets. 

• To extend our algorithm to several matrix factorization problems, we propose in Ap- 
pendix B efficient procedures for projecting onto two convex sets, which can be useful 
for other applications that are beyond the scope of this paper. 

1.2 Notation 

We define for p > 1 the £ p norm of a vector x in lR m as ||x|| p = (X^i I x [*]| p ) 1/ ' P j where x[i] 
denotes the i-th coordinate of x and HxH^ = maxj = i v .. jm |x[z]| = linip^oo ||x|| p . We also 
define the £q pseudo-norm as the sparsity measure which counts the number of nonzero 
elements in a vector: 2 ||x||o = #{« s.t. x[i] / 0} = lim p _ >0 +(^™ 1 |x[i]| p ). We denote the 

Frobenius norm of a matrix X in M mx ™ by ||X|| F = (££i X)?=i X M 2 ) 1/2 - For a sequence 
of vectors (or matrices) xt and scalars ut, we write x^ = O(ut) when there exists a constant 
K > so that for all t, \\x.t\\2 < Kut- Note that for finite-dimensional vector spaces, the 

2. Note that it would be more proper to write ||x||o instead of ||x||o to be consistent with the traditional 
notation ||x|| p . However, for the sake of simplicity, we will keep this notation unchanged. 
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choice of norm is essentially irrelevant (all norms are equivalent). Given two matrices A 
in M m i xra i and B in W n2Xn2 , A £3 B denotes the Kronecker product between A and B, 
defined as the matrix in K m i m 2 xn i"2 ] defined by blocks of sizes x ri2 equal to A[i, j]B. 
For more details and properties of the Kronecker product, see Golub and Van Loan (1996), 
and Magnus and Neudecker (1999). 

2. Problem Statement 

Classical dictionary learning techniques for sparse representation (Olshausen and Field, 
1997; Engan et al., 1999; Lewicki and Sejnowski, 2000; Aharon et al., 2006; Lee et al., 2007) 
consider a finite training set of signals X = [xi, . . . , x n ] in M mxn and optimize the empirical 
cost function 

1 n 

/ n (D)4_^( Xl ,D), (1) 

i=i 

where D in M. mxk i s the dictionary, each column representing a basis vector, and £ is a loss 
function such that £(x, D) should be small if D is "good" at representing the signal x in 
a sparse fashion. The number of samples n is usually large, whereas the signal dimension 
m is relatively small, for example, m = 100 for 10 x 10 image patches, and n > 100, 000 
for typical image processing applications. In general, we also have fc«n (e.g., k = 200 for 
n = 100,000), but each signal only uses a few elements of D in its representation, say 10 
for instance. Note that, in this setting, overcomplete dictionaries with k > m are allowed. 
As others (see for example Lee et al., 2007), we define £(x, D) as the optimal value of the 
£\ sparse coding problem: 

£(x,D) = min -||x - Da||| + A||a||i, (2) 
c*eR fc 2 

where A is a regularization parameter. This problem is also known as basis pursuit (Chen 
et al., 1999), or the Lasso (Tibshirani, 1996). 3 It is well known that t\ regularization yields 
a sparse solution for a, but there is no direct analytic link between the value of A and the 
corresponding effective sparsity 1 1 ck 1 1 o - To prevent D from having arbitrarily large values 
(which would lead to arbitrarily small values of a), it is common to constrain its columns 
di, . . . , dfc to have an ^2-norm less than or equal to one. We will call C the convex set of 
matrices verifying this constraint: 

C = {D G R mxk s.t. Vj = 1, . . . , k, djdj < 1}. 

Note that the problem of minimizing the empirical cost / n (D) is not convex with respect 
to D. It can be rewritten as a joint optimization problem with respect to the dictionary D 

3. To be more precise, the original formulation of the Lasso is a constrained version of Eq. (2), with a 
constraint on the £i-norm of a: 

min i||x-Da||! s.t. ||a||i < T. (3) 

a£M k I 

Both formulations are equivalent in the sense that for every A > (respectively every T > 0), there 
exists a scalar T (respectively A) so that Equations (2) and (3) admit the same solutions. 
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and the coefficients a = [ati, . . . ,a n ] in M. kxn of the sparse decompositions, which is not 
jointly convex, but convex with respect to each of the two variables D and a when the 
other one is fixed: 

n ^ 

min Yl ( oH X * ~ DQ; ill2 + A||0£i| |i ) . (4) 

t=i 

This can be rewritten as a matrix factorization problem with a sparsity penalty: 

1 

DeC,aeR fcxn 2 



iiii ii ^||X — Da||f. + A| |ck| | i,i , 



where, as before, X = [xi,...,x n ] is the matrix of data vectors, and ||c»!||i,i denotes the 
l\ norm of the matrix a — that is, the sum of the magnitude of its coefficients. A natural 
approach to solving this problem is to alternate between the two variables, minimizing over 
one while keeping the other one fixed, as proposed by Lee et al. (2007) (see also Engan et al. 
1999 and Aharon et al. 2006, who use £q rather than l\ penalties, or Zou et al. 2006 for the 
problem of sparse principal component analysis). 4 Since the computation of the coefficients 
vectors ol l dominates the cost of each iteration in this block-coordinate descent approach, 
a second-order optimization technique can be used to accurately estimate D at each step 
when a is fixed. 

As pointed out by Bottou and Bousquet (2008), however, one is usually not interested 
in the minimization of the empirical cost f n (D) with high precision, but instead in the 
minimization of the expected cost 

/(D) = ExWx,D)] = lim / n (D) a.s., 

71— ¥00 

where the expectation (which is supposed finite) is taken relative to the (unknown) prob- 
ability distribution p(x) of the data. 5 In particular, given a finite training set, one should 
not spend too much effort on accurately minimizing the empirical cost, since it is only an 
approximation of the expected cost. An "inaccurate" solution may indeed have the same or 
better expected cost than a "well-optimized" one. Bottou and Bousquet (2008) further show 
that stochastic gradient algorithms, whose rate of convergence is very poor in conventional 
optimization terms, may in fact in certain settings be shown both theoretically and empir- 
ically to be faster in reaching a solution with low expected cost than second-order batch 
methods. With large training sets, the risk of overfitting is lower, but classical optimization 
techniques may become impractical in terms of speed or memory requirements. 

In the case of dictionary learning, the classical projected first-order projected stochastic 
gradient descent algorithm (as used by Aharon and Elad 2008; Kavukcuoglu et al. 2008 for 
instance) consists of a sequence of updates of D: 



D t = n c 



D t -i-<W D £(x t ,D t _ 1 ; 



where is the estimate of the optimal dictionary at iteration t, 5t is the gradient step, 
ILj is the orthogonal projector onto C, and the vectors x t are i.i.d. samples of the (un- 
known) distribution p(x). Even though it is often difficult to obtain such i.i.d. samples, 



4. In our setting, as in Lee et al. (2007), we have preferred to use the convex l\ norm, that has empirically 
proven to be better behaved in general than the Iq pseudo-norm for dictionary learning. 

5. We use "a.s." to denote almost sure convergence. 
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the vectors x< are in practice obtained by cycling on a randomly permuted training set. As 
shown in Section 6, we have observed that this method can be competitive in terms of speed 
compared to batch methods when the training set is large and when 5 t is carefully chosen. 
In particular, good results are obtained using a learning rate of the form St = a/(t + b), 
where a and b have to be well chosen in a data set-dependent way. Note that first-order 
stochastic gradient descent has also been used for other matrix factorization problems (see 
Koren et al., 2009 and references therein). 

The optimization method we present in the next section falls into the class of online 
algorithms based on stochastic approximations, processing one sample at a time (or a mini- 
batch), but further exploits the specific structure of the problem to efficiently solve it by 
sequentially minimizing a quadratic local surrogate of the expected cost. As shown in 
Section 3.5, it uses second-order information of the cost function, allowing the optimization 
without any explicit learning rate tuning. 

3. Online Dictionary Learning 

We present in this section the basic components of our online algorithm for dictionary learn- 
ing (Sections 3.1-3.3), as well as a few minor variants which speed up our implementation in 
practice (Section 3.4) and an interpretation in terms of a Kalman algorithm (Section 3.5). 

3.1 Algorithm Outline 

Our procedure is summarized in Algorithm 1. Assuming that the training set is composed 
of i.i.d. samples of a distribution p(x), its inner loop draws one element x t at a time, as in 
stochastic gradient descent, and alternates classical sparse coding steps for computing the 
decomposition on of x< over the dictionary T)t-i obtained at the previous iteration, with 
dictionary update steps where the new dictionary D t is computed by minimizing over C the 
function 

A(D) = |X) -Dot^li + AHoEilu), (5) 

and the vectors on for i < t have been computed during the previous steps of the algorithm. 
The motivation behind this approach is twofold: 

• The function f t , which is quadratic in D, aggregates the past information with a few 
sufficient statistics obtained during the previous steps of the algorithm, namely the 
vectors ctj, and it is easy to show that it upperbounds the empirical cost ft(Dt) from 
Eq. (1). One key aspect of our convergence analysis will be to show that /t(D t ) and 
ft(D t ) converge almost surely to the same limit, and thus that ft acts as a surrogate 
for ft- 

• Since ft is close to ft-\ for large values of t, so are and Dj_i, under suitable 
assumptions, which makes it efficient to use D(_i as warm restart for computing D^. 
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Algorithm 1 Online dictionary learning. 

Require: x G M m ~ p(x) (random variable and an algorithm to draw i.i.d samples of p), 
A £ K (regularization parameter), Do G K mxfc (initial dictionary), T (number of itera- 
tions) . 

1: A G R kxk <- 0, B G K mxfe <- (reset the "past" information). 

2: for f = 1 to T do 

3: Draw xj from p(x). 

4: Sparse coding: compute using LARS 

a t = argmini||x t - D t _ia:||| + A||a||i. 

a£R k ^ 

5: A t 4- A t _i + a t af . 

6: B t <- B t _i + X t £*f . 

7: Compute D t using Algorithm 2, with D t _x as warm restart, so that 

1 * 1 

D t = argmin- V (-||xj - Do;i||2 + A||Qi||i), 
Dec t f-^ \2 / 

= argminViTr(D T DA t )-TV(D T B t )). (6) 
Dec * v ^ / 

8: end for 

9: return (learned dictionary). 



Algorithm 2 Dictionary Update. 



Require: D = [di, . . . , d k ] G R mxfe (input dictionary), 
A= [ai,...,a fc ] GM fcxfc 
B = [bi,...,b fc ] GR mxfe 
repeat 

for j = 1 to fc do 

Update the j-th column to optimize for (6): 



-u,. 



max(||uj|| 2 , 1) 



4: end for 

5: until convergence 

6: return D (updated dictionary). 



3.2 Sparse Coding 

The sparse coding problem of Eq. (2) with fixed dictionary is an ^-regularized linear least- 
squares problem. A number of recent methods for solving this type of problems are based 
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on coordinate descent with soft thresholding (Fu, 1998; Friedman et al., 2007; Wu and 
Lange, 2008). When the columns of the dictionary have low correlation, we have observed 
that these simple methods are very efficient. However, the columns of learned dictionaries 
are in general highly correlated, and we have empirically observed that these algorithms 
become much slower in this setting. This has led us to use instead the LARS-Lasso al- 
gorithm, a homotopy method (Osborne et al., 2000; Efron et al., 2004) that provides the 
whole regularization path — that is, the solutions for all possible values of A. With an ef- 
ficient Cholesky-based implementation (see Efron et al., 2004; Zou and Hastie, 2005) for 
brief descriptions of such implementations) , it has proven experimentally at least as fast as 
approaches based on soft thresholding, while providing the solution with a higher accuracy 
and being more robust as well since it does not require an arbitrary stopping criterion. 

3.3 Dictionary Update 

Our algorithm for updating the dictionary uses block-coordinate descent with warm restarts 
(see Bertsekas, 1999). One of its main advantages is that it is parameter free and does not 
require any learning rate tuning. Moreover, the procedure does not require to store all the 
vectors Xj and ctj, but only the matrices A t = Yll=i a i a J m K fcxfc and B t = Yll=i x « Q f i* 1 
R mxk . Concretely, Algorithm 2 sequentially updates each column of D. A simple calculation 
shows that solving (6) with respect to the j-th column dj, while keeping the other ones fixed 
under the constraint djdj < 1, amounts to an orthogonal projection of the vector Uj defined 
in Eq. (7), onto the constraint set, namely the ^-ball here, which is solved by Eq. (7). Since 
the convex optimization problem (6) admits separable constraints in the updated blocks 
(columns), convergence to a global optimum is guaranteed (Bertsekas, 1999). In practice, 
the vectors on are sparse and the coefficients of the matrix are often concentrated on the 
diagonal, which makes the block-coordinate descent more efficient. 6 After a few iterations 
of our algorithm, using the value of D t _i as a warm restart for computing D t becomes 
effective, and a single iteration of Algorithm 2 has empirically found to be sufficient to 
achieve convergence of the dictionary update step. Other approaches have been proposed 
to update D: For instance, Lee et al. (2007) suggest using a Newton method on the dual 
of Eq. (6), but this requires inverting a k x k matrix at each Newton iteration, which is 
impractical for an online algorithm. 

3.4 Optimizing the Algorithm 

We have presented so far the basic building blocks of our algorithm. This section discusses 
a few simple improvements that significantly enhance its performance. 

3.4.1 Handling Fixed-Size Data Sets 

In practice, although it may be very large, the size of the training set often has a predefined 
finite size (of course this may not be the case when the data must be treated on the fly 
like a video stream for example). In this situation, the same data points may be examined 

6. We have observed that this is true when the columns of D are not too correlated. When a group 
of columns in D are highly correlated, the coefficients of the matrix A t concentrate instead on the 
corresponding principal submatrices of A t . 



8 



several times, and it is very common in online algorithms to simulate an i.i.d. sampling of 
p(x) by cycling over a randomly permuted training set (see Bottou and Bousquet, 2008 and 
references therein). This method works experimentally well in our setting but, when the 
training set is small enough, it is possible to further speed up convergence: In Algorithm 1, 
the matrices A t and B t carry all the information from the past coefficients ati,...,a t . 
Suppose that at time to, a signal x is drawn and the vector att is computed. If the same 
signal x is drawn again at time t > to, then it is natural to replace the "old" information a to 
by the new vector on in the matrices At and B t — that is, At <— Af_i + a t a[ — ctt af Q and 
Bt «— Bt_i + *-t a J ~ yi t OL 't - I n this setting, which requires storing all the past coefficients 
o.t , this method amounts to a block-coordinate descent for the problem of minimizing 
Eq. (4). When dealing with large but finite sized training sets, storing all coefficients an 
is impractical, but it is still possible to partially exploit the same idea, by removing the 
information from At and Bt that is older than two epochs (cycles through the data) , through 
the use of two auxiliary matrices A^ and B£ of size k x k and m x k respectively. These 
two matrices should be built with the same rules as A t and B t , except that at the end of 
an epoch, A t and B t are respectively replaced by A' t and Bj, while A' t and B£ are set to 0. 
Thanks to this strategy, At and B t do not carry any coefficients an older than two epochs. 

3.4.2 Scaling the "Past" Data 

At each iteration, the "new" information ait that is added to the matrices At and Bt has 
the same weight as the "old" one. A simple and natural modification to the algorithm 
is to rescale the "old" information so that newer coefficients ct t have more weight, which 
is classical in online learning. For instance, Neal and Hinton (1998) present an online 
algorithm for EM, where sufficient statistics are aggregated over time, and an exponential 
decay is used to forget out-of-date statistics. In this paper, we propose to replace lines 5 
and 6 of Algorithm 1 by 

At <- /3 t A t -i + a t otJ, 
Bt <- /3tB t _i + x t af , 

where f3 t = (l — j) p , and p is a new parameter. In practice, one can apply this strategy 
after a few iterations, once At is well-conditioned. Tuning p improves the convergence rate, 
when the training sets are large, even though, as shown in Section 6, it is not critical. To 
understand better the effect of this modification, note that Eq. (6) becomes 

1 * ' p 1 

D * ^ ^" EUCiT^ S © P (2 l|Xl " Dai ^ + A||ai111 )' 

= arg min ) (\ Tr(D T DAf) - Tr(D r B t )) . 

Dec Ej=iOA) pV2 ' 

When p = 0, we obtain the original version of the algorithm. Of course, different strategies 
and heuristics could also be investigated. In practice, this parameter p is useful for large data 
sets only (n > 100 000). For smaller data sets, we have not observed a better performance 
when using this extension. 
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3.4.3 Mini-Batch Extension 



In practice, we can also improve the convergence speed of our algorithm by drawing r/ > 1 
signals at each iteration instead of a single one, which is a classical heuristic in stochastic 
gradient descent algorithms. In our case, this is further motivated by the fact that the 
complexity of computing 77 vectors on is not linear in rj. A Cholesky-based implementation 
of LARS-Lasso for decomposing a single signal has a complexity of 0(kms + ks 2 ), where 
s is the number of nonzero coefficients. When decomposing rj signals, it is possible to pre- 
compute the Gram matrix D^D^ and the total complexity becomes 0(k 2 m + r](km + ks 2 )), 
which is much cheaper than rj times the previous complexity when ij is large enough and s 
is small. Let us denote by x t) i, . . . , x tjJ? the signals drawn at iteration t. We can now replace 
lines 5 and 6 of Algorithm 1 by 



3.4.4 Slowing Down the First Iterations 

As in the case of stochastic gradient descent, the first iterations of our algorithm may update 
the parameters with large steps, immediately leading to large deviations from the initial 
dictionary. To prevent this phenomenon, classical implementations of stochastic gradient 
descent use gradient steps of the form a/(t + b), where b "reduces" the step size. An 
initialization of the form Ao = iol an d Bo = toL>o with to > also slows down the first 
steps of our algorithm by forcing the solution of the dictionary update to stay close to Do- 
As shown in Section 6, we have observed that our method does not require this extension 
to achieve good results in general. 

3.4.5 Purging the Dictionary from Unused Atoms 

Every dictionary learning technique sometimes encounters situations where some of the 
dictionary atoms are never (or very seldom) used, which typically happens with a very bad 
initialization. A common practice is to replace these during the optimization by randomly 
chosen elements of the training set, which solves in practice the problem in most cases. For 
more difficult and highly regularized cases, it is also possible to choose a continuation strat- 
egy consisting of starting from an easier, less regularized problem, and gradually increasing 
A. This continuation method has not been used in this paper. 

3.5 Link with Second-order Stochastic Gradient Descent 

For unconstrained learning problems with twice differentiable expected cost, the second- 
order stochastic gradient descent algorithm (see Bottou and Bousquet, 2008 and references 
therein) improves upon its first-order version, by replacing the learning rate by the inverse 
of the Hessian. When this matrix can be computed or approximated efficiently, this method 
usually yields a faster convergence speed and removes the problem of tuning the learning 
rate. However, it cannot be applied easily to constrained optimization problems and requires 
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at every iteration an inverse of the Hessian. For these two reasons, it cannot be used for the 
dictionary learning problem, but nevertheless it shares some similarities with our algorithm, 
which we illustrate with the example of a different problem. 

Suppose that two major modifications are brought to our original formulation: (i) the 
vectors a. t are independent of the dictionary D — that is, they are drawn at the same time 
as x t ; (ii) the optimization is unconstrained — that is, C = W nxk . This setting leads to the 
least-square estimation problem 

min E (X)a) [||x-Da||l], (8) 

which is of course different from the original dictionary learning formulation. Nonetheless, 
it is possible to address Eq. (8) with our method and show that it amounts to using the 
recursive formula 

t _ 1 

T> t <- D t _i + (x t - D t _ia t )af ( a^af ) , 

which is equivalent to a second-order stochastic gradient descent algorithm: The gradient 
obtained at (x^o^) is the term — (x t — D t _ia t )aJ, and the sequence (1/t) ^2 i=1 oncxj 
converges to the Hessian of the objective function. Such sequence of updates admit a 
fast implementation called Kalman algorithm (see Kushner and Yin, 2003 and references 
therein). 



4. Convergence Analysis 

The main tools used in our proofs are the convergence of empirical processes (Van der Vaart, 
1998) and, following Bottou (1998), the convergence of quasi-martingales (Fisk, 1965). Our 
analysis is limited to the basic version of the algorithm, although it can in principle be 
carried over to the optimized versions discussed in Section 3.4. Before proving our main 
result, let us first discuss the (reasonable) assumptions under which our analysis holds. 



4.1 Assumptions 

(A) The data admits a distribution with compact support K. Assuming a compact 
support for the data is natural in audio, image, and video processing applications, where it 
is imposed by the data acquisition process. 

(B) The quadratic surrogate functions f t are strictly convex with lower-bounded 
Hessians. We assume that the smallest eigenvalue of the positive semi-definite matrix j At 
defined in Algorithm 1 is greater than or equal to some constant n\. As a consequence, At 
is invertible and ft is strictly convex with Hessian I <g) | A t . This hypothesis is in practice 
verified experimentally after a few iterations of the algorithm when the initial dictionary is 
reasonable, consisting for example of a few elements from the training set, or any common 
dictionary, such as DCT (bases of cosines products) or wavelets (Mallat, 1999). Note that 
it is easy to enforce this assumption by adding a term ^||D||^ to the objective function, 
which is equivalent to replacing the positive semi-definite matrix \At by jAt + K\L. We 
have omitted for simplicity this penalization in our analysis. 

(C) A particular sufficient condition for the uniqueness of the sparse coding 
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solution is satisfied. Before presenting this assumption, let us briefly recall classical 
optimality conditions for the t\ decomposition problem in Eq. (2) (Fuchs, 2005). For x 
in K and D in C, a in ]R fc is a solution of Eq. (2) if and only if 



Let a* be such a solution. Denoting by A the set of indices j such that |dj(x — Da*)| = A, 
and Da the matrix composed of the columns from D restricted to the set A, it is easy to 



where a.\ is the vector containing the values of a* corresponding to the set A and £\[j] is 
equal to the sign of a\[j] for all j. With this preliminary uniqueness condition in hand, 
we can now formulate our assumption: We assume that there exists K2 > such that, 
for all x in K and all dictionaries D in the subset of C considered by our algorithm, the 
smallest eigenvalue o/D^Da is greater than or equal to k 2 . This guarantees the invertibility 
of (D^Da) and therefore the uniqueness of the solution of Eq. (2). It is of course easy to 
build a dictionary D for which this assumption fails. However, having D^Da invertible is a 
common assumption in linear regression and in methods such as the LARS algorithm aimed 
at solving Eq. (2) (Efron et al., 2004). It is also possible to enforce this condition using 
an elastic net penalization (Zou and Hastie, 2005), replacing ||o:||i by ||o:||i + ^f||o:||2 and 
thus improving the numerical stability of homotopy algorithms, which is the choice made 
by Zou et al. (2006). Again, we have omitted this penalization in our analysis. 

4.2 Main Results 

Given assumptions (A)-(C), let us now show that our algorithm converges to a stationary 
point of the objective function. Since this paper is dealing with non-convex optimization, 
neither our algorithm nor any one in the literature is guaranteed to find the global optimum 
of the optimization problem. However, such stationary points have often been found to be 
empirically good enough for practical applications, for example, for image restoration (Elad 
and Aharon, 2006; Mairal et al., 2008b). 

Our first result (Proposition 1 below) states that given (A)-(C), f(Dt) converges almost 
surely and f(D t ) — /t(D t ) converges almost surely to 0, meaning that ft acts as a converging 
surrogate of /. First, we prove a lemma to show that D t — T)t~i = 0(l/i). It does 
not ensure the convergence of D^, but guarantees the convergence of the positive sum 
J2t^i — Di_i|||,, a classical condition in gradient descent convergence proofs (Bertsekas, 
1999). 

Lemma 1 [Asymptotic variations of D t ]. 

Assume (A) -(C). Then, 



Proof. This proof is inspired by Prop 4.32 of Bonnans and Shapiro (2000) on the Lipschitz 
regularity of solutions of optimization problems. Using assumption (B), for all t, the 




(9) 
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surrogate ft is strictly convex with a Hessian lower-bounded by n±. Then, a short calculation 
shows that it verifies the second- order growth condition 

/t(D t+ i) - / t (D t ) > Ki||D t+ i - B t \\ 2 F . (11) 

Moreover, 

/ t (D t+1 ) - / t (D t ) = /t(D t+1 ) - / t+ i(D t+1 ) + ; t+ i(D t+1 ) - ; t+ i(D t ) + ; m (D t ) - / t (D t ) 
< ; t (D t+1 ) - ; m (D m ) + ; t+1 (D t ) - ; t (D t ), 

where we have used that / t+1 (D t+1 ) — f t+ i(D t ) < because D t+1 minimizes on C. 
Since / t (D) = Tr(D T DA t ) - Tr(D T B t )), and ||D|| F < Vk, it is possible to show that 
ft — ft+i is Lipschitz with constant q = (l/t)(||B t+ i — B t ||p + \/A;||A t+ i — A t ||ir), which 
gives 

/t(D t+ i) - / t (D t ) < Q||D m - D t || F . (12) 
From Eq. (11) and (12), we obtain 

||D t+ i-D t || F < — . 

Assumptions (A), (C) and Eq. (10) ensure that the vectors ai and Xj are bounded with 
probability one and therefore q = 0(1/ 1) a.s. ■ 

We can now state and prove our first proposition, which shows that we are indeed 
minimizing a smooth function. 

Proposition 1 [Regularity of /]. 

Assume (A) to (C). For x in the support K of the probability distribution p, and D in the 
feasible set C, let us define 

a*(x, D) = argmin -||x - Da||| + A||a||i. (13) 

aSM fe ^ 

Then, 

1. the function £ defined in Eq. (2) is continuously differentiate and 

V D ^(x, D) = -(x - Da*(x, D))a*(x, D) T . 

2. f is continuously differ entiable and V/(D) = E x [Vd^(x, D)] ; 

3. V/(D) is Lipschitz on C. 

Proof. Assumption (A) ensures that the vectors a* are bounded for x in K and D in C. 
Therefore, one can restrict the optimization problem (13) to a compact subset of Under 
assumption (C), the solution of Eq. (13) is unique and a* is well-defined. Theorem 1 in 
Appendix A from Bonnans and Shapiro (1998) can be applied and gives us directly the first 
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statement. Since K is compact, and t is continuously differ entiable, the second statement 
follows immediately. 

To prove the third claim, we will show that for all x in K, ct*(x, .) is Lipschitz with a 
constant independent of x, 7 which is a sufficient condition for V/ to be Lipschitz. First, 
the function optimized in Eq. (13) is continuous in a, D, x and has a unique minimum, 
implying that a* is continuous in x and D. 

Consider a matrix D in C and x in K and denote by a* the vector cc*(x, D), and again 
by A the set of indices j such that |dj(x — Da*)| = A. Since dj(x — Da*) is continuous 
in D and x, there exists an open neighborhood V around (x, D) such that for all (x',D') 
in V, and j <£ A, |dj'(x' - D'a*')| < A and a*'\j] = 0, where a*' = c**(x',D'). 

Denoting by Ua the matrix composed of the columns of a matrix U corresponding 
to the index set A and similarly by ua the vector composed of the values of a vector u 
corresponding to A, we consider the function £ 

^(x,D a ,Q!a) = — DackaIII + A| |ck a | |i, 

Assumption (C) tells us that ^(x, Da,.) is strictly convex with a Hessian lower-bounded 
by /«2- Let us consider (x',D') in V. A simple calculation shows that 

£(x,D A ,a A ') -^(x,D A ,a A ) > rc 2 ||a A ' ~ "aIII- 

Moreover, it is easy to show that £(x, Da, .)— €(x', D' a , .) is Lipschitz with constant ei||D A — 
D A ||ir + e2||x — x' 1 1 2 , where ei,e2 are constants independent of D,D',x, x' and then, one 
can show that 

\\a*' - a*\\ 2 = \\a\' - a* A \\ 2 < — (ei||D - D'|| F + e 2 ||x - x'^). 

Therefore, a* is locally Lipschitz. Since K x C is compact, a* is uniformly Lipschitz on 
K x C, which concludes the proof. ■ 



Now that we have shown that / is a smooth function, we can state our first result 
showing that the sequence of functions ft acts asymptotically as a surrogate of / and that 
f(Dt) converges almost surely in the following proposition. 

Proposition 2 [Convergence of f(Dt) and of the surrogate function]. Let ft denote 
the surrogate function defined in Eq. (5). Assume (A) to (C). Then, 

1. ft(Dt) converges almost surely; 

2. /(D t ) — ft(Dt) converges almost surely to 0; 

3. /(Dt) converges almost surely. 

7. From now on, for a vector x in R m , a*(x, .) denotes the function that associates to a matrix D verifying 
Assumption (C), the optimal solution a*(x, D). For simplicity, we will use these slightly abusive notation 
in the rest of the paper. 
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Proof. Part of this proof is inspired by Bottou (1998). We prove the convergence of the 
sequence ft(D t ) by showing that the stochastic positive process 

u t = f t (Pt) > 0, 

is a quasi-martingale and use Theorem 2 from Fisk (1965) (see Appendix A), which states 
that if the sum of the "positive" variations of u t are bounded, ut is a quasi-martingale, which 
converges with probability one (see Theorem 2 for details). Computing the variations of u t , 
we obtain 



ut+i -u t = / t+ i(D t+ i) - f t (D t ) 

= / t+ i(D t+ i) - / t+ i(D t ) + / t+ i(D t ) - / t (D t ) 

i rn , i rrn , l(x t+1 ,D t )-/ t (D t ) / t (D t )-/ f (D t 
= Jt+l(L>t+l) ~ Jt+l(L> t ) H — — h 



(14) 



t + l t + l 



using the fact that / t+ i(D t ) = T ^ I £(x t+ i,~D t ) + j^iftiPt)- Since D t+ i minimizes / t+ i on C 
and isinC, ft+i(Tit+i)~ft+i(J^t) < 0. Since the surrogate upperbounds the empirical 
cost ft, we also have ft(D t ) — /t(D t ) < 0. To use Theorem 2, we consider the filtration of 
the past information Ft and take the expectation of Eq. (14) conditioned on Ft, obtaining 
the following bound 

E| „ l+1 - UlK1 <5!^«. D «)ra-« D «) 



t+l 
f(V t ) - fj(Pt) 
t + l 

11/ ~ ft\\oo 

t + l ' 



For a specific matrix D, the central-limit theorem states that E[\Zi(f(Dt) — ft(Dt))] is 
bounded. However, we need here a stronger result on empirical processes to show that 
^[v^H/ — /t 1 1 oo ] is bounded. To do so, we use the Lemma 2 in Appendix A, which is a 
corollary of Donsker theorem (see Van der Vaart, 1998, chap. 19.2). It is easy to show that in 
our case, all the hypotheses are verified, namely, £(x, .) is uniformly Lipschitz and bounded 
since it is continuously differentiable on a compact set, the set C C M. mxk is bounded, and 
E x [£(x, D) 2 ] exists and is uniformly bounded. Therefore, Lemma 2 applies and there exists 
a constant n > such that 

E[E[u t+1 - u t \F t } + ] < 4- 

t2 

Therefore, defining St as in Theorem 2, we have 

oo oo 

Y,m(u t+ i - u t )} = j2 E M u t+i - u t \F t \ + ] < +oo. 

t=\ t=l 
Thus, we can apply Theorem 2, which proves that ut converges almost surely and that 



\E[u t+ i - u t \Ft]\ < +oo a.s. 



t=i 
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Using Eq. (14) we can show that it implies the almost sure convergence of the positive sum 

A /t(D*)-/t(Dt) 

Using Lemma 1 and the fact that the functions ft and ft are bounded and Lipschitz, with a 
constant independent of t, it is easy to show that the hypotheses of Lemma 3 in Appendix 
A are satisfied. Therefore 

/ t (D t ) - / t (D t ) — > a.s. 

Since /t(D t ) converges almost surely, this shows that /t(D t ) converges in probability to the 
same limit. Note that we have in addition \\f t — /||oo — >t->+oo a.s. (see Van der Vaart, 
1998, Theorem 19.4 (Glivenko-Cantelli)). Therefore, 

/(D t )-/t(D t ) — > a.s. 

t—t+co 

and /(Dt) converges almost surely, which proves the second and third points. ■ 

With Proposition 2 in hand, we can now prove our final and strongest result, namely 
that first-order necessary optimality conditions are verified asymptotically with probability 
one. 

Proposition 3 [Convergence to a stationary point]. Under assumptions (A) to (C) ; 
the distance between T) t and the set of stationary points of the dictionary learning problem 
converges almost surely to when t tends to infinity. 

Proof. Since the sequences of matrices A t ,B t are in a compact set, it is possible to 
extract converging subsequences. Let us assume for a moment that these sequences converge 
respectively to two matrices Aoo and Boo- In that case, T) t converges to a matrix Doo in C. 
Let U be a matrix in M mxfc . Since ft upperbounds ft on M. mxk , for all t, 

/*(D t + U) >/ t (D t + U). 

Taking the limit when t tends to infinity, 

jUDoo+U) >f(T>oo + U). 

Let h t > be a sequence that converges to 0. Using a first order Taylor expansion, and 
using the fact that V/ is Lipschitz and /oo(Doo) = /(Doo) a -s., we have 

/(D^) + Tr(^U T V/oo(Doo)) + o(ht\J) > /(Doo) + Tr(/ lt U T V/(D 00 )) + o(h t U), 
and it follows that 

Tr (^-L_U T V/oo(Doo)) > Tr (^_L_U T V/(Doo)) , 

Since this inequality is true for all U, V/oo(Doo) = V/(Doo). A first-order necessary opti- 
mality condition for Doo being an optimum of /oo is that — V/oo is in the normal cone of 
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the set C at Doo (Borwein and Lewis, 2006). Therefore, this first-order necessary conditions 
is verified for / at as well. Since Aj,Bj are asymptotically close to their accumulation 
points, — V/(D t ) is asymptotically close the normal cone at D t and these first-order opti- 
mally conditions are verified asymptotically with probability one. ■ 



5. Extensions to Matrix Factorization 

In this section, we present variations of the basic online algorithm to address different op- 
timization problems. We first present different possible regularization terms for a and D, 
which can be used with our algorithm, and then detail some specific cases such as non- 
negative matrix factorization, sparse principal component analysis, constrained sparse cod- 
ing, and simultaneous sparse coding. 

5.1 Using Different Regularizers for a 

In various applications, different priors for the coefficients a may lead to different regular- 
izers ip(a). As long as the assumptions of Section 4.1 are verified, our algorithm can be 
used with: 

• Positivity constraints on a that are added to the £i-regularization. The homotopy 
method presented in Efron et al. (2004) is able to handle such constraints. 

• The Tikhonov regularization, ip(a) = ^\\a\\2, which does not lead to sparse solutions. 

• The elastic net (Zou and Hastie, 2005), 4>(a) = Ai||a||i + ^||o:|||, leading to a 
formulation relatively close to Zou et al. (2006). 

• The group Lasso (Yuan and Lin, 2006; Turlach et al., 2005; Bach, 2008), ip(a) = 
Ylt=i Il a ill2> where ctj is a vector corresponding to a group of variables. 

Non-convex regularizers such as the £o pseudo-norm, £ p pseudo-norm with p < 1 can be 
used as well. However, as with any classical dictionary learning techniques exploiting non- 
convex regularizers (e.g., Olshausen and Field, 1997; Engan et al., 1999; Aharon et al., 
2006), there is no theoretical convergence results in these cases. Note also that convex 
smooth approximation of sparse regularizers (Bradley and Bagnell, 2009), or structured 
sparsity- inducing regularizers (Jenatton et al., 2009a; Jacob et al., 2009) could be used as 
well even though we have not tested them. 

5.2 Using Different Constraint Sets for D 

In the previous subsection, we have claimed that our algorithm could be used with different 
regularization terms on ex. For the dictionary learning problem, we have considered an £2- 
regularization on D by forcing its columns to have less than unit ^-norm. We have shown 
that with this constraint set, the dictionary update step can be solved efficiently using a 
block-coordinate descent approach. Updating the j-th column of D, when keeping the other 
ones fixed is solved by orthogonally projecting the vector Uj = dj + (1/A[j, j])(hj — Da^) on 
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the constraint set C, which in the classical dictionary learning case amounts to a projection 
of Uj on the ^2-ball. 

It is easy to show that this procedure can be extended to different convex constraint 
sets C as long as the constraints are a union of independent constraints on each column 
of D and the orthogonal projections of the vectors Uj onto the set C can be done efficiently. 
Examples of different sets C that we propose as an alternative to C are 

• The "non-negative" constraints: 

C' = {DeR mxk s.t. Vj = l,...,k, ||d,-|| 2 <l and dj > 0}. 

• The "elastic- net" constraints: 

C' = {Del mxfc s.t. Vj = l,...,k, ||dj||l+7||dj||i < 1}. 

These constraints induce sparsity in the dictionary D (in addition to the sparsity- 
inducing regularizer on the vectors ccj). By analogy with the regularization proposed 
by Zou and Hastie (2005), we call these constraints "elastic-net constraints." Here, 
7 is a new parameter, controlling the sparsity of the dictionary D. Adding a non- 
negativity constraint is also possible in this case. Note that the presence of the £2 
regularization is important here. It has been shown by Bach et al. (2008) that using 
the £i-norm only in such problems lead to trivial solutions when k is large enough. The 
combination of i\ and £2 constraints has also been proposed recently for the problem 
of matrix factorization by Witten et al. (2009), but in a slightly different setting. 

• The "fused lasso" (Tibshirani et al., 2005) constraints. When one is looking for a 
dictionary whose columns are sparse and piecewise-constant, a fused lasso regulariza- 
tion can be used. For a vector u in R m , we consider the £i-norm of the consecutive 
differences of u denoted by 

m 

FL(u)^|u[*]-u[*-l]|, 

i=2 

and define the "fused lasso" constraint set 

C' = {Del raxt s.t. Vj = l,...,k, ||dj||l + 7i||dj||i+TfcFL(dj)<l}. 

This kind of regularization has proven to be useful for exploiting genomic data such 
as CGH arrays (Tibshirani and Wang, 2008). 

In all these settings, replacing the projections of the vectors uj onto the ^-ball by the 
projections onto the new constraints, our algorithm is still guaranteed to converge and find 
a stationary point of the optimization problem. The orthogonal projection onto the "non 
negative" ball is simple (additional thresholding) but the projection onto the two other sets 
is slightly more involved. In Appendix B, we propose two algorithms for efficiently solving 
these problems. The first one is presented in Section B.l and computes the projection of 
a vector onto the elastic-net constraint in linear time, by extending the efficient projection 
onto the £i-ball from Maculan and de Paula (1989) and Duchi et al. (2008). The second 
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one is a homotopy method, which solves the projection on the fused lasso constraint set in 
0(ks), where s is the number of piecewise-constant parts in the solution. This method also 
solves efficiently the fused lasso signal approximation problem presented in Friedman et al. 
(2007): 

min Trllb - ulln + 7i||u||i + 72 FL(u) + 7 3 ||u||n. 

uSR n 2 

Being able to solve this problem efficiently has also numerous applications, which are beyond 
the scope of this paper. For instance, it allows us to use the fast algorithm of Nesterov (2007) 
for solving the more general fused lasso problem (Tibshirani et al., 2005). Note that the 
proposed method could be used as well with more complex constraints for the columns of D, 
which we have not tested in this paper, addressing for instance the problem of structured 
sparse PCA (Jenatton et al., 2009b). 

Now that we have presented a few possible regularizers for en and D, that can be used 
within our algorithm, we focus on a few classical problems which can be formulated as 
dictionary learning problems with specific combinations of such regularizers. 



5.3 Non Negative Matrix Factorization 

Given a matrix X = [xi, . . . , x n ] in M mx ™, Lee and Seung (2001) have proposed the non 
negative matrix factorization problem (NMF), which consists of minimizing the following 
cost 

n 1 

min } -llxj — Dai||| s.t. D > 0, V i, ctj > 0. 

r>ec,am kxn z — 1 2 
i=\ 

With this formulation, the matrix D and the vectors on are forced to have non negative 
components, which leads to sparse solutions. When applied to images, such as faces, Lee 
and Seung (2001) have shown that the learned features are more localized than the ones 
learned with a classical singular value decomposition. As for dictionary learning, classical 
approaches for addressing this problem are batch algorithms, such as the multiplicative 
update rules of Lee and Seung (2001), or the projected gradient descent algorithm of Lin 
(2007). 

Following this line of research, Hoyer (2002, 2004) has proposed non negative sparse 
coding (NNSC), which extends non-negative matrix factorization by adding a sparsity- 
inducing penalty to the objective function to further control the sparsity of the vectors ctf. 



n ^ k 

(-||x i -Da i ||! + \J2 a i[j]) s -t- D>0, Vi, <Xi>0. 



mm 

DeC,aeR kxn . 

1=1 3=1 



When A = 0, this formulation is equivalent to NMF. The only difference with the dictionary 
learning problem is that non- negativity constraints are imposed on D and the vectors ol l . A 
simple modification of our algorithm, presented above, allows us to handle these constraints, 
while guaranteeing to find a stationary point of the optimization problem. Moreover, our 
approach can work in the setting when n is large. 
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5.4 Sparse Principal Component Analysis 

Principal component analysis (PCA) is a classical tool for data analysis, which can be 
interpreted as a method for finding orthogonal directions maximizing the variance of the 
data, or as a low-rank matrix approximation method. Jolliffe et al. (2003), Zou et al. (2006), 
d'Aspremont et al. (2007), d'Aspremont et al. (2008), Witten et al. (2009) and Zass and 
Shashua (2007) have proposed different formulations for sparse principal component analysis 
(SPCA), which extends PCA by estimating sparse vectors maximizing the variance of the 
data, some of these formulations enforcing orthogonality between the sparse components, 
whereas some do not. In this paper, we formulate SPCA as a sparse matrix factorization 
which is equivalent to the dictionary learning problem with eventually sparsity constraints 
on the dictionary — that is, we use the ^i-regularization term for a and the "elastic-net" 
constraint for D (as used in a penalty term by Zou et al. 2006): 

n j 

min Yl ( oH Xi ~ Ba i\\i + M\ a i\h ) s.t. Vj = l,...,/c, Hdjlll + 7||dj||i < 1. 

1=1 

As detailed above, our dictionary update procedure amounts to successive orthogonal pro- 
jection of the vectors Uj on the constraint set. More precisely, the update of dj becomes 

u i <- Al^iJ (bi " Daj) + di ' 

dj argmin ||uj — d||2 s.t. ||d||| + 7||d||i < 1, 

deR m 

which can be solved in linear time using Algorithm 3 presented in Appendix B. In addition 
to that, our SPCA method can be used with fused Lasso constraints as well. 

5.5 Constrained Sparse Coding 

Constrained sparse coding problems are often encountered in the literature, and lead to 
different loss functions such as 

f(x,D) = min ||x-Da||| s.t. ||a||i<T, (15) 

or 

T(x,D) = min ||q||i s.t. ||x-Da|||<e, (16) 

aet* 

where T and e are pre-defined thresholds. Even though these loss functions lead to equiva- 
lent optimization problems in the sense that for given x, D and A, there exist e and T such 
that £(x, D), £'(x, D) and £"(x, D) admit the same solution a*, the problems of learning D 
using £, £' of £" are not equivalent. For instance, using £" has proven experimentally to be 
particularly well adapted to image denoising (Elad and Aharon, 2006; Mairal et al., 2008b). 

For all T, the same analysis as for £ can be carried for £' , and the simple modifica- 
tion which consists of computing a t using Eq. (15) in the sparse coding step leads to the 
minimization of the expected cost minogc E x [£'(x, D)]. 

Handling the case £" is a bit different. We propose to use the same strategy as for £'— 
that is, using our algorithm but computing a t solving Eq. (16). Even though our analysis 
does not apply since we do not have a quadratic surrogate of the expected cost, experimental 
evidence shows that this approach is efficient in practice. 
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5.6 Simultaneous Sparse Coding 

In some situations, the signals Xj are not i.i.d samples of an unknown probability distribu- 
tion, but are structured in groups (which are however independent from each other), and 
one may want to address the problem of simultaneous sparse coding, which appears also 
in the literature under various names such as group sparsity or grouped variable selection 
(Cotter et al., 2005; Turlach et al., 2005; Yuan and Lin, 2006; Obozinski et al., 2009, 2008; 
Zhang et al, 2008; Tropp et al., 2006; Tropp, 2006). Let X = [xi,...,x,] G R mxq be a 
set of signals. Suppose one wants to obtain sparse decompositions of the signals on the 
dictionary D that share the same active set (non-zero coefficients). Let ex = [cx±, . . . , a q ] in 
R kxq be the matrix composed of the coefficients. One way of imposing this joint sparsity 
is to penalize the number of non-zero rows of ex. A classical convex relaxation of this joint 
sparsity measure is to consider the ^i^-norm on the matrix ex 

k 

\\ a \\i,2 = ||o: J ||2, 

where ex? is the j-th row of ex. In that setting, the ^i^-norm of cx is the ^i-norm of the 
^2-norm of the rows of ex. 

The problem of jointly decomposing the signals Xj can be written as a ^i^-sparse de- 
composition problem, which is a subcase of the group Lasso (Turlach et al., 2005; Yuan and 
Lin, 2006; Bach, 2008), by defining the cost function 

£'"(X,D) = min ^||X-Da||! + A||a||i, 2 , 

which can be computed using a block-coordinate descent approach (Friedman et al., 2007) 
or an active set method (Roth and Fischer, 2008). 

Suppose now that we are able to draw groups of signals Xj, i = 1, . . . ,n which have 
bounded size and are independent from each other and identically distributed, one can learn 
an adapted dictionary by solving the optimization problem 

n 

min lim -V /"(Xj,D). 

i=l 

Being able to solve this optimization problem is important for many applications. For in- 
stance, in Mairal et al. (2009c), state-of-the-art results in image denoising and demosaicking 
are achieved with this formulation. The extension of our algorithm to this case is relatively 
easy, computing at each sparse coding step a matrix of coefficients ex, and keeping the 
updates of A t and B t unchanged. 

All of the variants of this section have been implemented. Next section evaluates some 
of them experimentally. An efficient C++ implementation with a Matlab interface of these 
variants is available on the Willow project-team web page http : / / www . di . ens . f r/ willow/ 
SPAMS/. 
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6. Experimental Validation 

In this section, we present experiments on natural images and genomic data to demonstrate 
the efficiency of our method for dictionary learning, non-negative matrix factorization, and 
sparse principal component analysis. 

6.1 Performance Evaluation for Dictionary Learning 

For our experiments, we have randomly selected 1.25 x 10 6 patches from images in the Pascal 
VOC'06 image database (Everingham et al., 2006), which is composed of varied natural 
images; 10 6 of these are kept for training, and the rest for testing. We used these patches to 
create three data sets A, B, and C with increasing patch and dictionary sizes representing 
various settings which are typical in image processing applications: We have centered and 



Data set 


Signal size m 


Nb k of atoms 


Type 


A 


8 x 8 = 64 


256 


b&w 


B 


12 x 12 x 3 = 432 


512 


color 


C 


16 x 16 = 256 


1024 


b&w 



normalized the patches to have unit ^-norm and used the regularization parameter A = 
1.2/^/m in all of our experiments. The Xj^fm term is a classical normalization factor 
(Bickel et al., 2009), and the constant 1.2 has shown to yield about 10 nonzero coefficients 
for data set A and 40 for data sets B and C in these experiments. We have implemented 
the proposed algorithm in C++ with a Matlab interface. All the results presented in 
this section use the refinements from Section 3.4 since this has lead empirically to speed 
improvements. Although our implementation is multithreaded, our experiments have been 
run for simplicity on a single-CPU, single-core 2.66Ghz machine. 

The first parameter to tune is rj, the number of signals drawn at each iteration. Trying 
different powers of 2 for this variable has shown that rj = 512 was a good choice (lowest 
objective function values on the training set — empirically, this setting also yields the lowest 
values on the test set). Even though this parameter is fairly easy to tune since values of 
64, 128, 256 and 1024 have given very similar performances, the difference with the choice 
rj = 1 is significant. 

Our implementation can be used in both the online setting it is intended for, and in 
a regular batch mode where it uses the entire data set at each iteration. We have also 
implemented a first-order stochastic gradient descent algorithm that shares most of its code 
with our algorithm, except for the dictionary update step. This setting allows us to draw 
meaningful comparisons between our algorithm and its batch and stochastic gradient alter- 
natives, which would have been difficult otherwise. For example, comparing our algorithm 
to the Matlab implementation of the batch approach from Lee et al. (2007) developed by 
its authors would have been unfair since our C++ program has a built-in speed advantage. 8 
To measure and compare the performances of the three tested methods, we have plotted 

8. Both LARS and the feature-sign algorithm (Lee et al., 2007) require a large number of low- level operations 
which are not well optimized in Matlab. We have indeed observed that our C++ implementation of LARS 
is up to 50 times faster than the Matlab implementation of the feature-sign algorithm of Lee et al. (2007) 
for our experiments. 
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the value of the objective function on the test set, acting as a surrogate of the expected 
cost, as a function of the corresponding training time. 

6.1.1 Online vs. Batch 

The left column of Figure 1 compares the online and batch settings of our implementation. 
The full training set consists of 10 6 samples. The online version of our algorithm draws 
samples from the entire set, and we have run its batch version on the full data set as well 
as subsets of size 10 4 and 10 5 (see Figure 1). The online setting systematically outperforms 
its batch counterpart for every training set size and desired precision. We use a logarithmic 
scale for the computation time, which shows that in many situations, the difference in 
performance can be dramatic. Similar experiments have given similar results on smaller data 
sets. Our algorithm uses all the speed-ups from Section 3.4. The parameter p was chosen 
by trying the values 0, 5, 10, 15, 20, 25, and to by trying different powers of 10. We have 
selected (to = 0.001, p = 15), which has given the best performance in terms of objective 
function evaluated on the training set for the three data sets. We have plotted three curves 
for our method: OL1 corresponds to the optimal setting (to = 0.001,/? = 15). Even though 
tuning two parameters might seem cumbersome, we have plotted two other curves showing 
that, on the contrary, our method is very easy to use. The curve OL2, corresponding to the 
setting (to = 0.001, p = 10), is very difficult to distinguish from the first curve and we have 
observed a similar behavior with the setting (to = 0.001, p = 20). showing that our method 
is robust to the choice of the parameter p. We have also observed that the parameter p is 
useful for large data sets only. When using smaller ones (n < 100, 000), it did not bring any 
benefit. 

Moreover, the curve OL3 is obtained without using a tuned parameter to — that is, p = 15 
and to = 0, and shows that its influence is very limited since very good results are obtained 
without using it. On the other hand, we have observed that using a parameter to too big, 
could slightly slow down our algorithm during the first epoch (cycle on the training set). 

6.1.2 Comparison with Stochastic Gradient Descent 

Our experiments have shown that obtaining good performance with stochastic gradient 
descent requires using both the mini-batch heuristic and carefully choosing a learning rate 
of the form a/(rjt + b). To give the fairest comparison possible, we have thus optimized 
these parameters. As for our algorithm, sampling n values among powers of 2 (as before) 
has shown that rj = 512 was a good value and gives a significant better performance than 
rj = 1. 

In an earlier version of this work (Mairal et al., 2009a), we have proposed a strategy 
for our method which does not require any parameter tuning except the mini-batch rj and 
compared it with the stochastic gradient descent algorithm (SGD) with a learning rate of 
the form a/(nt). While our method has improved in performance using the new parameter p, 
SGD has also proven to provide much better results when using a learning rate of the form 
a/(r]t + b) instead of a/(r/t), at the cost of an extra parameter b to tune. Using the learning 
rate a/(rjt) with a high value for a results indeed in too large initial steps of the algorithm 
increasing dramatically the value of the objective function, and a small value of a leads to 
bad asymptotic results, while a learning rate of the form a/(r]t + b) is a good compromise. 
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Figure 1: Left: Comparison between our method and the batch approach for dictionary 
learning. Right: Comparison between our method and stochastic gradient de- 
scent. The results are reported for three data sets as a function of computation 
time on a logarithmic scale. Note that the times of computation that are less 
than 0.1s are not reported. See text for details. 
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We have tried different powers of 10 for a and b. First selected the couple (a = 
100, 000, b = 100, 000) and then refined it, trying the values 100, 000 x 2 i for i = -3, . . . , 3. 
Finally, we have selected (a = 200,000,6 = 400,000). As shown on the right column of 
Figure 1, this setting represented by the curve SGI leads to similar results as our method. 
The curve SG2 corresponds to the parameters (a = 400, 000, b = 400, 000) and shows that 
increasing slightly the parameter a makes the curves worse than the others during the first 
iterations (see for instance the curve between 1 and 10 2 seconds for data set A), but still 
lead to good asymptotic results. The curve SG3 corresponds to a situation where a and b 
are slightly too small (a = 50, 000, b = 100, 000). It is as good as SGI for data sets A and B, 
but asymptotically slightly below the others for data set C. All the curves are obtained as 
the average of three experiments with different initializations. Interestingly, even though 
the problem is not convex, the different initializations have led to very similar values of 
the objective function and the variance of the experiments was always insignificant after 10 
seconds of computations. 

6.2 Non Negative Matrix Factorization and Non Negative Sparse Coding 

In this section, we compare our method with the classical algorithm of Lee and Seung 
(2001) for NMF and the non-negative sparse coding algorithm of Hoyer (2002) for NNSC. 
The experiments have been carried out on three data sets with different sizes: 

• Data set D is composed of n = 2,429 face images of size m = 19 x 19 pixels from the 
the MIT-CBCL Face Database #1 (Sung, 1996). 

• Data set E is composed of n = 2, 414 face images of size m = 192 x 168 pixels from 
the Extended Yale B Database (Georghiades et al., 2001; Lee et al., 2005). 

• Data set F is composed of n = 100, 000 natural image patches of size m = 16 x 16 
pixels from the Pascal VOC'06 image database (Everingham et al., 2006). 

We have used the Matlab implementations of NMF and NNSC of P. Hoyer, which are 
freely available at http : //www. cs .Helsinki . f i/u/phoyer/sof tware .html. Even though 
our C++ implementation has a built-in advantage in terms of speed over these Matlab 
implementations, most of the computational time of NMF and NNSC is spent on large 
matrix multiplications, which are typically well optimized in Matlab. All the experiments 
have been run for simplicity on a single-CPU, single-core 2.4GHz machine, without using the 
parameters p and to presented in Section 3.4 — that is, p = and to = 0. As in Section 6.1, 
a minibatch of size i] = 512 is chosen. Following the original experiment of Lee and Seung 
(2001) on data set D, we have chosen to learn k = 49 basis vectors for the face images data 
sets D and E, and we have chosen k = 64 for data set F. Each input vector is normalized 
to have unit ^-norm. 

The experiments we present in this section compare the value of the objective function 
on the data sets obtained with the different algorithms as a function of the computation 
time. Since our algorithm learns the matrix D but does not provide the matrix a, the 
computation times reported for our approach include two steps: First, we run our algorithm 
to obtain D. Second, we run one sparse coding step over all the input vectors to obtain ex. 
Figure 2 presents the results for NMF and NNSC. The gradient step for the algorithm of 
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Hoyer (2002) was optimized for the best performance and A was set to Both D and a 
were initialized randomly. The values reported are those obtained for more than 0.1s of 
computation. Since the random initialization provides an objective value which is by far 
greater than the value obtained at convergence, the curves are all truncated to present 
significant objective values. All the results are obtained using the average of 3 experiments 
with different initializations. As shown on Figure 2, our algorithm provides a significant 
improvement in terms of speed compared to the other tested methods, even though the 
results for NMF and NNSC could be improved a bit using a C++ implementation. 



6.3 Sparse Principal Component Analysis 

We present here the application of our method addressing SPCA with various types of data: 
faces, natural image patches, and genomic data. 



6.3.1 Faces and Natural Patches 

In this section, we compare qualitatively the results obtained by PCA, NMF, our dictionary 
learning and our sparse principal component analysis algorithm on the data sets used in 
Section 6.2. For dictionary learning, PCA and SPCA, the input vectors are first centered 
and normalized to have a unit norm. Visual results are presented on figures 3, 4 and 5, 
respectively for the data sets D, E and F. The parameter A for dictionary learning and 
SPCA was set so that the decomposition of each input signal has approximately 10 nonzero 
coefficients. The results for SPCA are presented for various values of the parameter 7, 
yielding different levels of sparsity. The scalar r indicates the percentage of nonzero values 
of the dictionary. 

Each image is composed of k small images each representing one learned feature vector. 
Negative values are blue, positive values are red and the zero values are represented in white. 
Confirming earlier observations from Lee and Seung (2001), PCA systematically produces 
features spread out over the images, whereas NMF produces more localized features on the 
face databases D and E. However, neither PCA, nor NMF are able to learn localized features 
on the set of natural patches F. On the other hand, the dictionary learning technique is 
able to learn localized features on data set F, and SPCA is the only tested method that 
allows controlling the level of sparsity among the learned matrices. 



6.3.2 Genomic Data 

This experiment follows Witten et al. (2009) and demonstrates that our matrix decompo- 
sition technique can be used for analyzing genomic data. Gene expression measurements 
and DNA copy number changes (comparative genomic hybridization CGH) are two popular 
types of data in genomic research, which can be used to characterize a set of abnormal tissue 
samples for instance. When these two types of data are available, a recent line of research 
tries to analyze the correlation between them — that is, to determine sets of expression genes 
which are correlated with sets of chromosomal gains or losses (see Witten et al., 2009 and 
references therein). Let us suppose that for n tissue samples, we have a matrix X in R" xp 
of gene expression measurements and a matrix Y in R nx,? of CGH measurements. In order 
to analyze the correlation between these two sets of data, recent works have suggested the 
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Figure 2: Left: Comparison between our method and the approach of Lee and Seung (2001) 
for NMF. Right: Comparison between our method and the approach of Hoyer 
(2002) for NNSC. The value of the objective function is reported for three data 
sets as a function of computation time on a logarithmic scale. 
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Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D. 

28 



• .4* A A* ' tfc ' 



(a) PCA 



(b) SPCA, r = 70% 



4 i 



1 e 



V awe 



7 . > i 

r • w - *( ■n,\m r% 

l* 1 a i ; v 



•Vr 



I . • f - * ; . J \ • • 



(c) NMF 



(d) SPCA, r = 30% 



f 



i 



i « > 



4 V; ' 



-s • 



I 



(e) Dictionary Learning 
Figure 4: Results obtained by PCA, NMF, 



(f) SPCA, r = 10% 

dictionary learning, SPCA for data set E. 



29 



zi i*m: :=:«»>::< 



r ■ ■ , 
■tat 



^0 l Z 1 II ' 

=_ I* Ah = 
II - * 



j= fc 9 ^1 1^ 



(a) PCA 



(b) SPCA, r = 70% 



Cft^Z^^r tjaH""! F fir 



II - ' lis 

" _| II - 

- I- - 



i ■ 



(c) NMF 



(d) SPCA, r = 30% 



= = ir , a / = 
i r sb 

0^- A 1. 



il 



I 



I ^ . ' I- 
I" • _ I 



'I i — * 



I. I 



(e) Dictionary Learning (f) SPCA, r = 10% 

Figure 5: Results obtained by PCA, NMF, dictionary learning, SPCA for data set F. 



30 



use of canonical correlation analysis (Hotelling, 1936), which solves 9 

min cov(Xu,Yv) s.t. NXulU < 1 and NYvlU < 1. 

ueffip,veR9 

When X and Y are centered and normalized, it has been further shown that with this type 
of data, good results can be obtained by treating the covariance matrices X T X and Y T Y 
as diagonal, leading to a rank-one matrix decomposition problem 

min NX^Y — uv T ||p s.t. NulU < 1, and llvlU < 1. 

Furthermore, as shown by Witten et al. (2009), this method can benefit from sparse reg- 
ularizes such as the t\ norm for the gene expression measurements and a fused lasso for 
the CGH arrays, which are classical choices used for these data. The formulation we have 
chosen is slightly different from the one used by Witten et al. (2009) and can be addressed 
using our algorithm: 

min ||Y T X- vu T ||| + A||u|| 2 s.t. ||v||? + 7i||v||i + 72 FL(v) < 1. (17) 

In order to assess the effectivity of our method, we have conducted the same experiment 
as Witten et al. (2009) using the breast cancer data set described by Chin et al. (2006), 
consisting of q = 2, 148 gene expression measurements and p = 16, 962 CGH measurements 
for n = 89 tissue samples. The matrix decomposition problem of Eq. (17) was addressed 
once for each of the 23 chromosomes, using each time the CGH data available for the 
corresponding chromosome, and the gene expression of all genes. Following the original 
choice of Witten et al. (2009), we have selected a regularization parameter A resulting in 
about 25 non-zero coefficients in u, and selected 71 = 72 = 1, which results in sparse and 
piecewise-constant vectors v. The original matrices (X, Y) are divided into a training set 
(X tr , Ytr) formed with 3/4 of the n samples, keeping the rest (X te , Y te ) for testing. This 
experiment is repeated for 10 random splits, for each chromosome a couple of factors (u, v) 
are computed, and the correlations corr(X tr u, Y tr v) and corr(X ie u, Yi e v) are reported on 
Figure 6. The average standard deviation of the experiments results was 0.0339 for the 
training set and 0.1391 for the test set. 

Comparing with the original curves reported by Witten et al. (2009) for their penalized 
matrix decomposition (PMD) algorithm, our method exhibits in general a performance 
similar as PMD. 10 Nevertheless, the purpose of this section is more to demonstrate that our 
method can be used with genomic data than comparing it carefully with PMD. To draw 
substantial conclusions about the performance of both methods, more experiments would 
of course be needed. 

6.4 Application to Large-Scale Image Processing 

We demonstrate in this section that our algorithm can be used for a difficult large-scale im- 
age processing task, namely, removing the text (inpainting) from the damaged 12-Megapixel 

9. Note that when more than one couple of factors are needed, two sequences ui, 112, . . . and vi, V2, . . . of 
factors can be obtained recursively subject to orthogonality constraints of the sequences Xui,Xu2, . . . 
and Yvi, Yv2, .... 

10. The curves for PMD were generated with the R software package available at http://cran.r- project . 
org/web/packages/PMA/index .html and a script provided by Witten et al. (2009). 
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Genomic Experiment: Correlation Analysis 




5 10 15 20 

Chromosome 



Figure 6: SPCA was applied to the covariance matrix obtained from the breast cancer data 
(Chin et al., 2006). A fused lasso regularization is used for the CGH data. 3/4 
of the n samples are used as a training set, keeping the rest for testing. Average 
correlations from 10 random splits are reported for each of the 23 chromosomes, 
for PMD (Witten et al., 2009) and our method denoted by OL. 
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image of Figure 7. Using a multi-threaded version of our implementation, we have learned 
a dictionary with 256 elements from the roughly 7 x 10 6 undamaged 12 x 12 color patches 
in the image with two epochs in about 8 minutes on a 2.4GHz machine with eight cores. 
Once the dictionary has been learned, the text is removed using the sparse coding tech- 
nique for inpainting of Mairal et al. (2008b). Our intent here is of course not to evaluate our 
learning procedure in inpainting tasks, which would require a thorough comparison with 
state-the-art techniques on standard data sets. Instead, we just wish to demonstrate that 
it can indeed be applied to a realistic, non-trivial image processing task on a large image. 
Indeed, to the best of our knowledge, this is the first time that dictionary learning is used 
for image restoration on such large-scale data. For comparison, the dictionaries used for 
inpainting in Mairal et al. (2008b) are learned (in batch mode) on 200,000 patches only. 




Figure 7: Inpainting example on a 12-Megapixel image. Top: Damaged and restored im- 
ages. Bottom: Zooming on the damaged and restored images. Note that the 
pictures presented here have been scaled down for display. (Best seen in color). 
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7. Conclusion 



We have introduced in this paper a new stochastic online algorithm for learning dictionar- 
ies adapted to sparse coding tasks, and proven its convergence. Experiments demonstrate 
that it is significantly faster than batch alternatives such as Engan et al. (1999), Aharon 
et al. (2006) and Lee et al. (2007) on large data sets that may contain millions of train- 
ing examples, yet it does not require a careful learning rate tuning like regular stochastic 
gradient descent methods. Moreover, we have extended it to other matrix factorization 
problems such as non negative matrix factorization, and we have proposed a formulation 
for sparse principal component analysis which can be solved efficiently using our method. 
Our approach has already shown to be useful for image restoration tasks such as denoising 
(Mahal et al., 2009c); more experiments are of course needed to better assess its promise 
in bioinformatics and signal processing. Beyond this, we plan to use the proposed learning 
framework for sparse coding in computationally demanding video restoration tasks (Protter 
and Elad, 2009), with dynamic data sets whose size is not fixed, and extending this frame- 
work to different loss functions (Mahal et al., 2009b) to address discriminative tasks such 
as image classification, which are more sensitive to overfitting than reconstructive ones. 
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Appendix A. Theorems and Useful Lemmas 

We provide in this section a few theorems and lemmas from the optimization and probability 
literature, which are used in this paper. 

Theorem 1 [Corollary of Theorem 4.1 from Bonnans and Shapiro (1998), due 
to Danskin (1967)]. 

Let f : W x R q — > R. Suppose that for all x £ W the function /(x, .) is differentiable, 
and that f and V u /(x, u) the derivative o/ /(x, .) are continuous on W x W 1 . Let v(u) 
be the optimal value function v(u) = minxgc 1 /(x, u), where C is a compact subset ofW. 
Then v(u) is directionally differentiable. Furthermore, if for uq 6 M. q , /(.,uo) has a unique 
minimizer xo then v(u) is differentiable in uq and V u t>(uo) = V u /(xo,uo). 

Theorem 2 [Sufficient condition of convergence for a stochastic process, see Bot- 
tou (1998) and references therein (Metivier, 1983; Fisk, 1965)]. 

Let (SI, J- ,P) be a measurable probability space, ut, fort > 0, be the realization of a stochastic 
process and Ft be the filtration determined by the past information at time t. Let 





otherwise. 
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If for all t, u t > and Ylt=-i ^[^t( u t+i ~~ u t)\ < ^ en u t is a quasi-martingale and 
converges almost surely. Moreover, 

oo 

^ |E[u t+ i - ittlJ 7 *]! < +00 a.s. 
t=i 

Lemma 2 [A corollary of Donsker theorem see Van der Vaart, 1998, chap. 19.2, 
lemma 19.36 and example 19.7]. 

Let F = {fg : \ — > K, 9 £ 0} be a set of measurable functions indexed by a bounded subset 
O ofM. d . Suppose that there exists a constant K such that 

\f ei (x)-fe 2 (x)\<K\\9 l -9 2 \\ 2 , 

for every 9\ and 9 2 in O and x in %. Then, F is F '-Donsker (see Van der Vaart, 1998, 
chap. 19.2). For any f in F , Let us define F n f, Pf and G n f as 

1 11 

V n f = -52f(Xi), Pf = E x [f(X)}, G n / = V^(P„/-P/). 

1=1 

Let us also suppose that for all f , Pf 2 < 8 2 and ||/||oo < M and that the random elements 
Xi, X 2 , ■ ■ ■ are Borel-measurable. Then, we have 

E P ||<G n || F = 0(l), 

where ||G n ||i? = supj g ^|G n /|. For a more general variant of this lemma and additional 
explanations and examples, see Van der Vaart (1998). 

Lemma 3 [A simple lemma on positive converging sums]. 

Let a n , b n be two real sequences such that for all n, a n > 0,b n > 0, Yln=i a n = °°; 
Yju=i a nbn < 00, 3K > s.t. \b n+1 - b n \ < Ka n . Then, lim n ^ +00 b n = 0. 

Proof. The proof is similar to Bertsekas (1999, prop 1.2.4). ■ 



Appendix B. Efficient Projections Algorithms 

In this section, we address the problem of efficiently projecting a vector onto two sets of 
constraints, which allows us to extend our algorithm to various other formulations. 

B.l A Linear-time Projection on the Elastic-Net Constraint 

Let b be a vector of IR m . We consider the problem of projecting this vector onto the 
elastic-net constraint set: 

1 7 
min -lib - ull, s.t. ||u||i + -llulln < r. (18) 

To solve efficiently the case 7 > 0, we propose Algorithm 3, which extends Maculan and 
de Paula (1989) and Duchi et al. (2008), and the following lemma which shows that it solves 
our problem. 
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Lemma 4 [Projection onto the elastic-net constraint set]. 

For b in W 11 , 7 > and r > 0, Algorithm 3 solves Eq. (18). 

Proof. First, if b is a feasible point of (18), then b is a solution. We suppose therefore 
that it is not the case — that is,||b||i + 2r||b||2 > r. Let us define the Lagrangian of (18) 



£(u,A) = l||b-u||| + A(||u|| 1 + |||u|||-r). 



For a fixed A, minimizing the Lagrangian with respect to u admits a closed- form solu- 
tion u*(A), and a simple calculation shows that, for all j, 

= sign(b[j])(|b[j]|-A) + 
1 >Ul I + A7 

Eq. (18) is a convex optimization problem. Since Slater's conditions are verified and strong 
duality holds, it is equivalent to the dual problem 

max£(u*(A),A). 

A>o 

Since A = is not a solution, denoting by A* the solution, the complementary slackness 
condition implies that 

I|u*(A*)||i + |||u*(A*)||1 = t. (19) 

Using the closed form of u*(A) is possible to show that the function A —> ||u*(A)||i + 
^||u*(A)|||, is strictly decreasing with A and thus Eq. (19) is a necessary and sufficient 
condition of optimality for A. After a short calculation, one can show that this optimality 
condition is equivalent to 



E (|bb1l + ||bbH 2 -A(l + ^))=r, 



(I + A7) 2 

where 5(A) = {j s.t. \h[j]\ > A}. Suppose that 5(A*) is known, then A* can be computed 
in closed-form. To find 5(A*), it is then sufficient to find the index k such that 5(A*) = 
5( I b [A;]|), which is the solution of 



sx a + |b' WM ' ( |bb11 + l |bM|2 - |bW "' + ^) < T - 



Lines 4 to 14 of Algorithm 3 are a modification of Duchi et al. (2008) to address this prob- 
lem. A similar proof as Duchi et al. (2008) shows the convergence to the solution of this 
optimization problem in 0{m) in the average case, and lines 15 to 18 of Algorithm 3) com- 
pute A* after that 5(A*) has been identified. Note that setting 7 to leads exactly to the 
algorithm of Duchi et al. (2008). ■ 

As for the dictionary learning problem, a simple modification to Algorithm 3 allows us 
to handle the non-negative case, replacing the scalars |b[j]| by max(b[j], 0) in the algorithm. 
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Algorithm 3 Efficient projection on the elastic- net constraint. 
Require: r i R; 7 i R; b i R m ; 
if ||b||i + |||b||2 < r then 

return u ^— b. 
else 

U 4- {1, . . . , m}; s «- 0; p <- 0. 
while ?7 / do 

Pick k £ U at random. 
Partition U : 

G = {jeU s.t. |b[j]| > |b[fc]|}, 
L = {jeU s.t. |b[j]| < |b[fe]|}. 

Ap<-\G\; A s ^E, eG |b[j]| + i|b[j]| 2 . 

if s + As - (p + Ap)(l + l\b[k]\)\b[k}\ < r(l + 7 |b[fc]|) 2 then 

s <r- s + As; p «- Ap; U <- L. 
else 

U<-G\{k}. 
end if 



end while 

a <- 7 2 r + 
& <- 2jt + p, 
c < 
A 



c r — s, 



2a 



20: return u. 
21: end if 



I + A7 
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B.2 A Homotopy Method for Solving the Fused Lasso Signal Approximation 

Let b be a vector of M. m . We define, following Friedman et al. (2007), the fused lasso signal 
approximation problem V (71, 72, 73): 

min -||b-u||2 + 7i||u|| 1 +7 2 FL(u) + 5Hl| ) (20) 

the only difference with Friedman et al. (2007) being the addition of the last quadratic term. 
The method we propose to this problem is a homotopy, which solves P(tj\, T72, T73) for 
all possible values of r. In particular, for all e, it provides the solution of the constrained 
problem 

min-llb-uHo s.t. 7i||u||i + 72 FL(u) + ^Null 2 , < e. (21) 

The algorithm relies on the following lemma 

Lemma 5 Let ^(71,72,73) be the solution of Eq. (20), for specific values 0/71,72,73. 
Then 

• u*(7i, 72, 73) = iq^u*(7i,72,0). 

• For alii, u*(7i, 72, 0)[i] = sign(u*(0, 72, 0)[i]) max(|u*(0, 72, 0)[i]\ -Ai,0)— that is, 
u* (71, 72,0) can be obtained by soft thresholding of u*(0, 72, 0). 

The first point can be shown by short calculation. The second one is proven in Friedman 
et al. (2007) by considering subgradient optimality conditions. This lemma shows that if 
one knows the solution of 7- , (0,72,0), then 7^(71,72,73) can be obtained in linear time. 
It is therefore natural to consider the simplified problem 

min -lib - ullo +7 2 FL(u). (22) 

With the change of variable v[l] = u[l] and v[i] = u[i] — u[i — 1] for % > 1, this problem 
can be recast as a weighted Lasso 

^ m 

v^2 l|b - Dv|| 2 + E^l v WI' ( 23 ) 

i=l 

where w\ = and Wi = 72 for i > 1, and D[i,j] = 1 if i > j and otherwise. We propose 
to use LARS (Efron et al., 2004) and exploit the specific structure of the matrix D to make 
this approach efficient, by noticing that: 

• For a vector w in IR m , computing e = Dw requires 0(m) operations instead of 0(m 2 ), 
by using the recursive formula e[l] = w[l], e[i + 1] = w[i] + e[i]. 

• For a vector w in W\ computing e = D T w requires 0(m) operations instead of 
0(m 2 ), by using the recursive formula e[n] = w[n], e[i — 1] = w[i — 1] + e[i]. 



38 



Let r = {ai, . . . ,a p } be an active set and suppose a\ < 
admits the closed form value 



< op. Then (D^Dr) 



(D^Dr) 1 = 



Ci -ci 
-Cl Cl + c 2 
-c 2 



where c„ 



i 



n+1— % 





V o 

and d - 








Oj+i— a, 





-C2 
C2 + C 3 




for i < p. 









-2 + Cp-l —Cp-l 

~ c p— 1 Cp_l + Cp J 



This allows the implementation of this homotopy method without using matrix inversion 
or Cholesky factorization, solving Eq. (23) in 0{ms) operations, where s is the number of 
non-zero values of the optimal solution v. 11 

Adapting this method for solving Eq. (21) requires following the regularization path of 
the problems 'P(0,T72,0) for all values of r, which provides as well the regularization path 
of the problem V(t\\, t\2, rA 3 ) and stops whenever the constraint becomes unsatisfied. 
This procedure still requires 0(ms) operations. 

Note that in the case 71 = and 73 = 0, when only the fused-lasso term is present in 
Eq (20), the same approach has been proposed in a previous work by Harchaoui and Levy- 
Leduc (2008), and Harchaoui (2008) to solve Eq. (22), with the same tricks for improving 
the efficiency of the procedure. 
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